Authorship Attribution in Modern Hebrew In partial fulfillment of requirements for

نویسنده

  • David Gabay
چکیده

This thesis deals with a text classification problem: the identification of the author of a text by its style. Given a text whose author is unknown, and a set of candidates with sample texts, we need to find the true author of the text. The authorship attribution problem has usages in the humanities, in forensic linguistics and in intelligence. The corpora on which this study was done are written in Modern Hebrew. Hebrew presents many challenges for this task: complex morphology, high level of ambiguity and general lack of processing tools. The corpus at the focus of the research is comprised of posts from web-blogs. The texts in this corpus are shorter than in most authorship attribution studies: most of them are a few hundred words long, the shortest only a few dozens words long. A secondary corpus was comprised of literary works from early Modern Hebrew literature. Each corpus contained text by nine different authors. Two general approaches were taken. The first considers a text as a sequence of characters, and uses general methods from information theory to compute distances between texts. In this approach, two methods were employed: one that is based on Markov chains and another that is based on text compression. In the second approach texts are represented as vectors of frequencies of linguistic elements, and are then classified using machine learning algorithm ms. The features that were tried in this work were lexical (prominent words, whose selection can be done in different ways) and morphological – parts of speech, or other morphological characteristics such as the construct state. The machine learning tool that was used was SVM. We have found the second approach performs better, reaching an accuracy level of over 98% on the literature corpus and 74% on the blog corpus. This approach was also less affected by variations in the topic of the texts. The first approach, which does not separate content from style, failed to classify texts written by authors on subjects that are atypical for them. Both approaches did not perform well on very short texts. Attempting to solve this by choosing only features that exist in the anonymous texts did not yield good results. Combining lexical and morphological features considerably improved the classification, even though the morphological disambiguation tool that was used is not perfect. The relatively simple measure of segmenting the prefixes had a positive effect, in any variation of the second approach. The study has demonstrated that authorship attribution in Hebrew is achievable in texts of sufficient length and that style identification requires morphological analysis. Table of

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tools to Aid OCR of Hebrew Character Manuscripts A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

...................................................................................................................3 Table of

متن کامل

Attitude Change and Self - Attribution of Responsibility

of Dissertation Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SELF-ATTRIBUTION OF RESPONSIBILITY AND ATTITUDE CHANGE AS FUNCTIONS OF THE ATTRIBUTIONS OF OTHERS

متن کامل

A survey of modern authorship attribution methods

Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, informati...

متن کامل

The mathematics of the epicycloid : a historical journey with a modern perspective

OF THESIS Submitted in Partial Fulfillment of the Requirements for the Degree of

متن کامل

اثربخشی آموزش ابراز وجود فرهنگمحور بر عزت‌نفس فرزندان طلاق

Brever, M.M.( 2010).The effects  of child gender and child age at the time of parental divorce on the development. COLLEGE OF SOCIAL AND BEHAVIORAL SCIENCES, Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Psychology Educational Track.  

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008